Discriminability under Insensitivity: Joint Time-Frequency Scattering vs Time Scattering

This example contrasts the properties of Joint Time-Frequency Scattering (JTFS) and Wavelet Time Scattering (TS), describes some of their applications, and advises when one should be used over the other.
We show:
  1. Frequency-Dependent Time-Shift (FDTS) discriminability: JTFS is discriminative, while TS isn't.
  2. Frequency-shift insensitivity: TS is sensitive, while JTFS optionally isn't.
  3. Time-shift insensitivity: both are insensitive, and both can be invariant.

Introduction

Wavelet scattering is a deep convolutional network formed by cascade of wavelets, complex modulus nonlinearities, and lowpass filters. Unlike a neural network, it uses fixed weights, making it more interpretable and controllable. It yields representations that are time-shift insensitive, robust to noise, and stable against time-warping deformations, with uses in classification, synthesis, and other tasks.
Joint Time-Frequency Scattering extends wavelet time scattering by exploiting time-frequency structure, adding sensitivity to frequency-dependent time shifts and optional insensitivity to frequency transposition, while inheriting time scattering's aforementioned properties.
We contrast these transforms via illustrative examples, and supply real-world context. This example borrows from [1].

FDTS discriminability

A frequency-dependent time-shift (FDTS) is a shifting in time of frequency slices of a signal, where the shifting depends on frequency. One formulation is:
where X is the Continuous Wavelet Transform (CWT), but other time-frequency representations can also be used.
To illustrate, we take windowed pure sines, and shift them more with greater frequency. We use gen_fdts, controlled as follows:
The specific choice of parameters is tailored for illustrative purposes.
N = 4096; % signal length
n_partials = 5;
partials_f_sep = 1.7;
seg_len = floor(N/6);
f0 = floor(N/24);
% /10 and /8 are fine also, /9 was chosen for prettier time waveform
total_shift = floor(N/9);
 
[x, xs] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0, total_shift);
Next, we configure the scattering objects. Of relevance are the following parameters:
And, specific to JTFS:
Parameters which weren't described are specified for backwards compatibility.
% common parameters
T = 'global';
Q = 16;
r_psi = 0.96;
pad_mode = 'zero';
max_pad_factor = [];
analytic = false;
precision = 'double';
 
% JTFS-specific parameters
max_pad_factor_fr = [];
sampling_filters_fr = 'resample';
out_exclude = {'S0', 'S1', 'psi_t * phi_f', 'phi_t * psi_f', 'phi_t * phi_f'};
 
% in case of future changes
out_type = 'array';
Q_fr = 1;
out_3D = false;
average_fr = false;
 
% pack the parameters
cfg_ts = dictionary( ...
T={T}, Q={Q}, r_psi={r_psi}, pad_mode={pad_mode}, ...
max_pad_factor={max_pad_factor}, analytic={analytic}, precision={precision});
cfg_jtfs = dictionary( ...
Q_fr={Q_fr}, sampling_filters_fr={sampling_filters_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, out_3D={out_3D}, ...
average_fr={average_fr}, out_exclude={out_exclude});
% `update_dictionary` joins the entries of `cfg_ts` and `cfg_jtfs`
cfg_jtfs = update_dictionary(cfg_ts, cfg_jtfs);
Make scattering objects
ts = Scattering1D(N, kwargs=cfg_ts, out_type=out_type);
jtfs = TimeFrequencyScattering1D(N, kwargs=cfg_jtfs, out_type=out_type);
Make "CWT" by having Scattering1D only return first-order unaveraged coefficients
% oversampling: high to disable subsampling and enable 2D concatenation
% out_type: 'list' since average=false doesn't permit 'array'
cwt = Scattering1D(N, kwargs=cfg_ts, ...
average=false, out_type='list', oversampling=99, max_order=1);
Apply the transforms
ts_x = ts(x);
ts_xs = ts(xs);
jtfs_x = jtfs(x);
jtfs_xs = jtfs(xs);
cwt_x = cwt(x);
cwt_xs = cwt(xs);
Pack CWT into array, and exclude S0
% `2:end` to exclude `S0`
cwt_x = cellfun(@(c)c{'coef'}, cwt_x(2:end), 'UniformOutput', false);
cwt_x = cat(1, cwt_x{:});
cwt_xs = cellfun(@(c)c{'coef'}, cwt_xs(2:end), 'UniformOutput', false);
cwt_xs = cat(1, cwt_xs{:});
Compute and display distances
% relative L2, `||x0 - x1|| / ||x0||`
rl2_ts = rel_l2(ts_x, ts_xs);
rl2_jtfs = rel_l2(jtfs_x, jtfs_xs);
 
fprintf([ ...
'FDTS sensitivity (relative L2)\n' ...
'JTFS = %.2f\n' ...
'TS = %.2e\n' ...
'JTFS/TS = %.1f\n' ...
], rl2_jtfs, rl2_ts, rl2_jtfs / rl2_ts)
FDTS sensitivity (relative L2) JTFS = 0.33 TS = 4.15e-04 JTFS/TS = 803.0
Make and save GIF
scriptDir = fileparts(matlab.desktop.editor.getActive().Filename);
saveDir = scriptDir;
make_gif_fdts(saveDir, x, xs, ts_x, ts_xs, jtfs_x, jtfs_xs, cwt_x, cwt_xs)
Result:
jtfs_ts.gif
TS doesn't visibly change at all, while a fair portion of JTFS does. The relative distances,
are (repeated for convenience)
JTFS = 0.33
TS = 4.15e-04
JTFS/TS = 803.0
Concerning the ratio:

Frequency shift insensitivity

A "frequency shift", in context of scattering, refers to a log-frequency shift. Like a horizontal shift in CWT is a time shift, a vertical shift in CWT is a log-frequency shift.
Illustrating, we take windowed sines, and shift them in log-frequency. We reuse gen_fdts, applying it twice with different initial frequency f0, and discard the shifted output:
% relevant parameters
N = 4096;
f0_0 = N/12;
f0_1 = N/20;
% other parameters
n_partials = 4;
partials_f_sep = 1.6;
seg_len = floor(N/4);
 
[x0, ~] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0_0);
[x1, ~] = gen_fdts(N, n_partials, partials_f_sep, seg_len, f0_1);
Next, the scattering objects. Here we make CWT for visualization, and TS and JTFS for scattering. Parameter notes:
The other parameter choices' rationale follows that of the preceding section.
% common parameters
Q = 16;
r_psi = .96;
analytic = false;
pad_mode = 'zero';
 
% JTFS-specific parameters
average_fr = true;
oversampling_fr = 99;
max_pad_factor_fr = [];
pad_mode_fr = 'zero';
out_exclude = {'S0', 'S1'};
 
% pack
ckw = dictionary(Q={Q}, r_psi={r_psi}, analytic={analytic}, ...
pad_mode={pad_mode});
jkw = dictionary(average_fr={average_fr}, oversampling_fr={oversampling_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, ...
pad_mode_fr={pad_mode_fr}, out_exclude={out_exclude});
jkw = update_dictionary(jkw, ckw);
 
% make scattering/CWT objects
ts = Scattering1D(N, kwargs=ckw);
jtfs0 = TimeFrequencyScattering1D(N, kwargs=jkw, F=1e-16);
jtfs1 = TimeFrequencyScattering1D(N, kwargs=jkw, F=8);
jtfs2 = TimeFrequencyScattering1D(N, kwargs=jkw, F=32);
jtfs3 = TimeFrequencyScattering1D(N, kwargs=jkw, F=127.9);
cwt = Scattering1D(N, kwargs=ckw, average=false, out_type='list', ...
oversampling=99, max_order=1);
Turn the frequential lowpass filters into constants; refer to the helper method's docstring for an explanation.
makePhiFrExactlyGlobalAveraging(jtfs3)
Apply transforms
ts_x0 = ts(x0);
ts_x1 = ts(x1);
 
jtfs0_x0 = jtfs0(x0);
jtfs0_x1 = jtfs0(x1);
 
jtfs1_x0 = jtfs1(x0);
jtfs1_x1 = jtfs1(x1);
 
jtfs2_x0 = jtfs2(x0);
jtfs2_x1 = jtfs2(x1);
 
jtfs3_x0 = jtfs3(x0);
jtfs3_x1 = jtfs3(x1);
 
cwt_x0 = cwt(x0);
cwt_x1 = cwt(x1);
 
% pack CWT into array, and exclude `S0`
cwt_x0 = cellfun(@(c)c{'coef'}, cwt_x0(2:end), 'UniformOutput', false);
cwt_x1 = cellfun(@(c)c{'coef'}, cwt_x1(2:end), 'UniformOutput', false);
cwt_x0 = cat(1, cwt_x0{:});
cwt_x1 = cat(1, cwt_x1{:});
For accurate visualization of shift distances, second-order paths should be ordered n2-first (n2 being the index of a second-order "path"), while the library computes it n1-first. We reorder them here
[ts_x0, ts_x1] = reorder_S2(ts, ts_x0, ts_x1);
Compute and display distances
% relative L2, `||x0 - x1|| / ||x0||`
rl2_ts = rel_l2(ts_x0, ts_x1);
rl2_jtfs0 = rel_l2(jtfs0_x0, jtfs0_x1);
rl2_jtfs1 = rel_l2(jtfs1_x0, jtfs1_x1);
rl2_jtfs2 = rel_l2(jtfs2_x0, jtfs2_x1);
rl2_jtfs3 = rel_l2(jtfs3_x0, jtfs3_x1);
 
% print results
fprintf([ ...
'Frequency shift sensitivity, JTFS/TS:\n' ...
'%.4f -- F=0\n' ...
'%.4f -- F=8\n' ...
'%.4f -- F=32\n' ...
'%.4f -- F=''global''\n'], ...
rl2_jtfs0 / rl2_ts, rl2_jtfs1 / rl2_ts, ...
rl2_jtfs2 / rl2_ts, rl2_jtfs3 / rl2_ts)
Frequency shift sensitivity, JTFS/TS: 0.9321 -- F=0 0.5681 -- F=8 0.2142 -- F=32 0.0079 -- F='global'
Make and save GIF
saveDir = scriptDir;
make_gif_freq_tp(saveDir, x0, x1, cwt_x0, cwt_x1, ts_x0, ts_x1, ...
jtfs0_x0, jtfs0_x1, jtfs1_x0, jtfs1_x1, ...
jtfs2_x0, jtfs2_x1, jtfs3_x0, jtfs3_x1)
Result:
freq_tp.gif
Shift-insensitivity is applied analogously to time-shifts, with the following distinctions:
The JTFS-to-TS relative distance ratios are (repeated for convenience)
0.9321 -- F=0
0.5681 -- F=8
0.2142 -- F=32
0.0079 -- F='global'
For:

Example: exponential chirp

We further illustrate JTFS's FDTS-discriminability by inspecting its "spin assymetry" in response to an exponential chirp (echirp).
An echirp is the "ideal" signal, as it forms a (approx.) straight line in log-frequency. It can be formed (approx.) by shifting a delta (unit impulse), by a constant in time per a constant shift in log-frequency. Despite being remarkably different, time scattering cannot distinguish them [3] (1). We make an echirp, and show its scalogram.
First, configure scattering. For illustration, we show the pre-averaged coefficients. Except for the following, parameters follow same logic as in other sections, or are simply tailored for visualizing our specific signal:
% base parameters
N = 4096;
Q = 16;
J = 8;
T = 2^J;
average = false;
oversampling = 99;
max_pad_factor = [];
pad_mode = 'zero';
paths_exclude = dictionary(n1_fr={1}, j2={[1 2 3]});
out_type = 'dict:list';
 
% JTFS-specific parameters
J_fr = 4;
Q_fr = 1;
F = 2^J_fr;
average_fr = false;
out_3D = false;
oversampling_fr = 99;
max_pad_factor_fr = [];
pad_mode_fr = 'zero';
sampling_filters_fr = 'resample';
max_noncqt_fr = 0;
 
% pack
jkw = dictionary( ...
Q={Q}, J={J}, T={T}, average={average}, oversampling={oversampling}, ...
max_pad_factor={max_pad_factor}, pad_mode={pad_mode}, ...
paths_exclude={paths_exclude}, out_type={out_type}, ...
J_fr={J_fr}, Q_fr={Q_fr}, F={F}, average_fr={average_fr}, ...
out_3D={out_3D}, oversampling_fr={oversampling_fr}, ...
max_pad_factor_fr={max_pad_factor_fr}, pad_mode_fr={pad_mode_fr}, ...
sampling_filters_fr={sampling_filters_fr}, max_noncqt_fr={max_noncqt_fr} ...
);
% make JTFS object
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
Generate the echirp. Span frequencies from low to Nyquist. fmin shouldn't be too low as it exacerbates low-frequency time warping and deviates from the straight-line idealization.
x = wavemat.toolkit.echirp(N, fmin=64, fmax=N/2);
Take JTFS
Scx = jtfs(x);
Compute and display energy ratio
up = cellfun(@(c)c{'coef'}, Scx{'psi_t * psi_f_up'}, 'UniformOutput', false);
dn = cellfun(@(c)c{'coef'}, Scx{'psi_t * psi_f_dn'}, 'UniformOutput', false);
up = cat(2, up{:});
dn = cat(2, dn{:});
E_up = sum(up.^2, 'all');
E_dn = sum(dn.^2, 'all');
 
fprintf('E_down / E_up = %.3g\n', E_dn / E_up)
E_down / E_up = 310
Make and save signal and scalogram plot. We fetch the scalogram from S1, since average=false.
cwt_x = Scx{'S1'};
saveDir = scriptDir;
make_scalogram(x, cwt_x, saveDir)
Make and save the 2D GIF visual.
make_gif_echirp2d(jtfs, Scx, saveDir)
Next, rebuild JTFS with time averaging and no excluded paths, and re-scatter. Use average=true so that our GIF has a manageable number of frames, and all paths so each 3D frame is richer.
jkw{'average'} = true;
jkw('oversampling') = [];
jkw('paths_exclude') = [];
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
Scx_full = jtfs(x);
Make and save the GIF
preset = 'spinned';
baseName = 'echirp_jtfs3d';
saveDir = scriptDir;
wavemat.visuals.gif_jtfs_3d( ...
Scx_full, jtfs=jtfs, preset=preset, base_name=baseName, savedir=saveDir, ...
cmap_norm=.5, overwrite=1, verbose=0)
Result:
echirp_scalogram.png
Each JTFS joint coefficient is output from a 2D convolution (rather, cross-correlation) with a 2D wavelet. We show the wavelets, and their corresponding coefficients:
echirp_jtfs2d.gif
Positive xi1_fr is "spin up", which is "phase contour down" and corresponds to downward-sloped geometry in time-frequency. Negative xi1_fr is "spin down", opposite to up. Coefficients with xi1_fr=0 or xi2=0 are complementary "unspinned" pairs.
We observe strong response in spin down, and vanishing response in spin up. The ratio of their energies is
E_down / E_up = 310
which is an excellent signal to a classifier, and useful for audio synthesis and as a feature in general.
This "spin assymetry" is a core feature of JTFS. It is FDTS-sensitive asymmetry within the coefficients of a single transform, which is something out of reach for TS. JTFS accomplishes this by breaking TS's tree structure and operating jointly along time and frequency, while TS operates on one vector at a time, agnostic to structures of neighboring features.
To visualize the evolution of JTFS along time, we make a 3D heatmap, with each temporal slice capturing time-frequency intensities across three degrees of freedom: xi1, xi2, and xi1_fr (where xi1 is first-order scattering along time)
echirp_jtfs3d.gif
We observe the response over time as a smooth morphing from lower to higher xi1, with minimal variation along xi2 and xi1_fr. This is as expected, as variation along xi2 and xi1_fr implies amplitude modulation along time and frequency, respectively, over time, which doesn't happen for an echirp (3).
We also observe that the contents of opposing spin occur only for t near start and finish; this is reflective of sharp boundaries, i.e. lack of windowing. Indeed, the energy ratio can otherwise be driven much higher.

Disentangling variability

JTFS linearizes and disentangles core sources of signal variability: carrier frequency (frequency offset), modulation frequency (A.M.), and chirp rate γ(F.M.). We reproduce and explain this result from [2].
Denoting the triplet with , we have
,
which in order, reads window * modulator * carrier, and where
To make coefficients more comparable across different , we use a fixed w, making coefficients have the same effective (frequency) span, so the only differences are due to energies and (frequency) offsets of coefficients. We use . Note, some implementation specifics, and value of w, differ from that of [2].
Visualize a few :
x_all.png
We see that in each case, the total vertical span (bandwidth) is unchanged.
To compare auditory similarity retrieval in TS and JTFS, we utilize the Isomap algorithm for nonlinear dimensionality reduction, as in [2]. Isomap reduces dimensionality by, in rough terms, grouping points by their "true" distribution:
iso_demo.png
The plots show the "Swiss roll" dataset (left), and its isomap (right), approximately reproduced from the original isomap paper.
Quoting the source, "For two arbitrary points [large dots] on a nonlinear manifold, their Euclidean distance in the high-dimensional input space (length of dashed line) may not accurately reflect their intrinsic similarity, as measured by geodesic distance along the low-dimensional manifold (length of solid curve)." Isomap approximates manifold distances by accounting for progression of distances; it connects two points by hopping along points in-between, and minimizing total distance hopped. Isomap (the plot on right) minimizes the overall error of all such connections.
We build Isomaps for TS and JTFS, sweeping logarithmically with sweep size 16, forming a total of 4096 signals. With , we obtain the following Isomaps:
isomap_jtfs_w=1_Q=16-front.png
For TS, we see blue and red points packed close together. This means that signals with significantly different have similar time scatterings, which is undesired. Meanwhile, JTFS's isomap is 3-D, matching that of our 3-D parameter space, and points are distributed approximately linearly (in a cube shape) and uniformly, maintaining more even gaps for differences in parameter space.
TS reliably maps and γ (see [2]), forming a 2-D manifold, but fails with . The manifold being 2-D makes it incapable of mapping distances consistent with the parameter space, which is 3-D. If we nudge or γ, we nudge along one of the two dimensions in TS's 2-D manifold; if we nudge , it will also nudge along one of the two dimensions, making it entangled with the other parameters. JTFS disentangles all three parameters.
Why this difference? We know from [2] that TS doesn't struggle with (which lacks frequency resolution), so we hypothesize that higher Q degrades the similarity retrieval of amplitude modulations. What is the dependence, and is there a "tipping point" Q? We investigate by visualizing:
multianalysis_w=1_Q=.gif
TS's 3-D Isomaps gradually collapse into 2-D along the dimension. While there isn't a strict point at which the collapse is "finalized", the manifold is visually flat past around . Meanwhile, JTFS retains its locally and globally quasi-linear distribution.
To understand why this happens, we fixate at the highest visualized Q of 16:
multianalysis_w=1_Q=16.png
We observe, that the individual rows are highly similar in shapes. Following the GIF sweeping to , we see this is per progressively degrading time resolution, temporally expanding narrow modulation cycles. At the same time, the energies of columns (sum of squares along time slices) encode substantially different modulation rates. Indeed, while the overall first-order (unaveraged) transform fully encodes , it does so over multiple coefficients (rows); as TS operates on per-coefficient basis, it is agnostic to adjacent coefficients, and this spread-out encoding is not reflected at the second order. JTFS, on the other hand, operates jointly in time and frequency. Above the Isomap plots, we see the TS's coefficients are nearly same, while JTFS's have definite differences.
Lastly, we distinguish coefficient "morphing" behaviors by sweeping parameters in order:
ts_vs_jtfs_3d.gif
In this case, the transforms perform comparably. To show a difference, the entangled dimensions must be explored. As both transforms are frequency shift equivariant, both perform well with ; indeed, the well-separated "segments" for TS isomaps are due to . This leaves γ as the entangled dimension. As the transform with an extra degree of freedom and explicit sensitivity for FDTS, JTFS excels at mapping out γ, while TS relies on "surrogate" variability in individual coefficients. As we've seen earlier, JTFS is sensitive to substantial parameter differences, while TS can conflate them.
To illustrate the problem, we fix and γ, for two choices of γ: one where TS works well, and one where it doesn't. And, we sweep .
x_cwt_ts.gif
Let be the initial . We observe from coefficient differences, and relative Euclidean distances, that for one of the γ, both JTFS and TS consistently grow away from . For the other γ, however, while JTFS behaves as before, TS first quickly grows away from , then grows closer to it, despite still growing away from . This shows that TS distances do not closely correspond to parameter distances in .
To show that and γ are entangled, we consider what happens for a neighboring choice of γ relative to the poorly-performing γ. Suppose this neighbor is well-behaved, and distances always grow with growing . Since this isn't the case for the reference γ, however, points close to in the neighbor γ can be about equally close to in the reference γ.
This means that the points cannot be distributed along separate, equally-spaced 1D manifolds. One candidate, with just two γ considered, is that there are two 1D manifolds that share a point (e.g. ) and are angled with respect to it, such that additional γ would form a circle. As differing γ generates a distance (so many γ with same cannot intersect), this configuration doesn't occur; instead, the 1D manifolds, while separated, are tightly and irregularly packed, and sometimes (approximately) intersect. Hence, the 1D manifolds do not group into a well-behaved extra dimension, and there is entanglement: points with different or different γ can be close, and incrementing either doesn't consistently increment distances.
To visualize these observations, we compare a Isomap slices with our fixed :
iso_app_0.png
TS Isomap is compressed vertically (which, as we'll see, is the dimension), and points within it are irregularly spaced and irregularly ordered. Meanwhile, JTFS Isomap is roughly uniform both vertically and horizontally.
We now show Isomap sub-slices with fixed γs, specifically utilizing the poorly-behaved one from our example, and one of its neighbors:
iso_app_1.png
We see that for different γ, JTFS is distributed along two distinct, roughly equally spaced 1D manifolds, while TS is irregularly spread out in 2D. JTFS is well-ordered, while TS has instances of points such that, farther are closer than closer (blue points close to red points).
JTFS hence, while preserving frequency resolution, disentangles foundational signal building blocks: frequency offset, frequency modulation, and amplitude modulation.

Disentangling variability (extended)

Placeholder text for explanation
multianalysis_threeway_fm_w=1_Q=.gif
Placeholder text for explanation
multianalysis_threeway_gamma_w=1_Q=.gif
Placeholder text for explanation

Time shift insensitivity

Wavelet scattering's signature property, insensitivity to time shifts, , also carries to JTFS. To illustrate, we
  1. Tukey-window White Gaussian Noise of 4096 samples.
  2. Shift it by 256 samples.
  3. Apply TS and JTFS to the original and its shifting, using same T for both. For JTFS, apply it using multiple F, from 0 to 'global'.
  4. Compute relative distances between transforms of shifted and unshifted signals.
  5. Repeat 3-4 for all T, from 0 to 'global', swept in powers of 2 (0, 1, 2, 4, 8, ...).
Start with the signal
N = 4096;
shift = 256;
rng(0)
 
% make padded Tukey window
win = [zeros(1, N/4), tukeywin(N/2, 0.5).', zeros(1, N/4)];
% make windowed White Gaussian Noise
x0 = randn(1, N) .* win;
% shift signal
x1 = circshift(x0, shift);
Make common scattering configurations to be reused in T, F sweeps. As before, except for following parameters, parameter logic is shared with other sections:
% set jtfs's `F` sweep
F_all = {1e-16, 2, 8, 'global'};
 
% make common configs
Q = 16;
r_psi = 0.96;
pad_mode = 'zero';
max_pad_factor = [];
precision = 'double';
 
% make JTFS configs
out_exclude = {'S0', 'S1'};
pad_mode_fr = 'zero';
max_pad_factor_fr = [];
sampling_filters_fr = 'resample';
out_3D = true;
average_fr = true;
 
% pack configs
ckw = dictionary( ...
Q={Q}, r_psi={r_psi}, pad_mode={pad_mode}, ...
max_pad_factor={max_pad_factor}, precision={precision});
jkw = dictionary( ...
out_exclude={out_exclude}, max_pad_factor_fr={max_pad_factor_fr}, ...
pad_mode_fr={pad_mode_fr}, sampling_filters_fr={sampling_filters_fr}, ...
out_3D={out_3D}, average_fr={average_fr});
Initialize loop variables
% containers
rl2_ts_all = dictionary();
rl2_jtfs_all = dictionary();
for F=F_all
rl2_jtfs_all{F(1)} = dictionary();
end
% dyadic scale of `N`, for sweeping
log2_N = floor(log2(N));
Iterate
for log2_T=0:log2_N
% T from log2_T.
% Internally, `T=2^log2_N` acts same as `T='global'` (in general, it's
% rather if `T=2^ceil(log2(N))`, but here that's same as our `log2_N`). See
% `T` docs for an explanation of (and caveats to) this design choice.
T = 2^log2_T;
% make TS
ckw{'T'} = T;
if log2_T == log2_N
ckw{'oversampling'} = 99;
end
ts = Scattering1D(N, kwargs=ckw);
if canUseGPU
ts.gpu()
end
 
% apply TS
ts_x0 = ts(x0);
ts_x1 = ts(x1);
% exclude `S0`
ts_x0 = ts_x0(:, 2:end, :);
ts_x1 = ts_x1(:, 2:end, :);
% compute distance
rl2_ts_all(log2_T) = gather(rel_l2(ts_x0, ts_x1));
 
% sweep `F`
for k=1:numel(F_all)
F = F_all{k};
% make JTFS
jkw = update_dictionary(jkw, ckw);
jkw{'F'} = F;
jtfs = TimeFrequencyScattering1D(N, kwargs=jkw);
if canUseGPU
jtfs.gpu()
end
 
% apply JTFS
jtfs_x0 = jtfs(x0);
jtfs_x1 = jtfs(x1);
% select joint pairs (in `{2}` per `out_3D=true` with `out_type='array'`)
jtfs_x0 = jtfs_x0{2};
jtfs_x1 = jtfs_x1{2};
% compute distance
rl2_jtfs_all{{F}}(log2_T) = gather(rel_l2(jtfs_x0, jtfs_x1));
end
end
Make and save plots
saveDir = scriptDir;
plot_T_sweep(rl2_ts_all, rl2_jtfs_all, F_all, saveDir)
Print distances
F1 = F_all{1};
F2 = F_all{2};
F3 = F_all{3};
F4 = F_all{4};
fprintf(['Relative L2 for T=''global'':\n' ...
'%.2e -- TS\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%d\n' ...
'%.2e -- JTFS, F=%s\n'], ...
rl2_ts_all(log2_N), ...
rl2_jtfs_all{{F1}}(log2_N), 0, ...
rl2_jtfs_all{{F2}}(log2_N), F2, ...
rl2_jtfs_all{{F3}}(log2_N), F3, ...
rl2_jtfs_all{{F4}}(log2_N), "'global'")
Relative L2 for T='global': 1.30e-16 -- TS 5.68e-16 -- JTFS, F=0 5.75e-16 -- JTFS, F=2 7.89e-16 -- JTFS, F=8 6.48e-17 -- JTFS, F='global'
Result:
T_sweep.png
A clear and monotonic dependence is observed for distance versus T, and versus F. For T='global', where we expect invariance, i.e. zero distance, we obtain (repeated for convenience)
1.30e-16 -- TS
5.68e-16 -- JTFS, F=0
5.75e-16 -- JTFS, F=2
7.89e-16 -- JTFS, F=8
6.48e-17 -- JTFS, F='global'
which is within float precision of zero. Note how, unlike with F='global', invariance with time shifts is faithfully reproduced, as unaveraged CWT (and by extension scattering) is exactly equivariant to time shifts, but not to frequency shifts.
For JTFS, we observe that higher F yields lower time-shift distances. Indeed, frequency averaging also has a time averaging effect, which is converse of the result shown in [9] that time averaging has a frequency averaging effect. In simple terms,
We also observe that for some T, the JTFS distances with F=0 slightly exceed that of TS. This is a legitimate effect and not a numeric artifact. In short, this isn't a concern in practice and insensitivity works as expected; refer to (5) for further discussion.

Warp-stability, noise-robustness

We state known properties of the scattering transform, without illustrating:
JTFS inherits noise robustness, since it inherits contractiveness. Contractiveness is inherited since it is a direct consequence of energy non-expansiveness of wavelet filterbanks (by design), and of the complex modulus operator [9].
JTFS also inherits warp-stability, but it's not identical to TS's. That is, following the earlier observation in the "Frequency shift insensitivty" section on distances with F=0, JTFS distances aren't required to follow TS distances, and can exceed them. Warp stability proofs in [8] are for TS only, and while distance behaviors are expected to agree closely in general, the exact relationship is unknown (to best of author's knowledge).

Applications of JTFS properties

  1. Audio classification: strong performance was shown in a wide variety of tasks [2] [3] [4] [5].
  2. Biomedical classification: strong performance was shown in [10], but existing research is limited.
  3. Audio synthesis: advantages over TS for audio texture synthesis, in better preserving alignment of partials upon high T, are shown in [2], [3]. [6] showed how spinned coefficients can be used to locally time-reverse a signal. [2], [5] showed that JTFS accurately models auditory similarities, and [7] additionally showed how JTFS, unlike the spectrogram or scalogram, is viable for mesoscale parameter retrieval.
Properties utilized:

JTFS vs TS: when to use which?

Synthesis: it's matter of what signal parameters one wishes to control:
Classification: the mainstream advice is, use TS when
We note, however, that the amount of published work involving JTFS is limited, especially that which analyzes its classification success. Without publishing, the original author of wavemat and this article has studied the matter in depth, and advises the following:

Extended notes

(1) Strictly speaking, it can, but it would take a great number of orders, more than in virtually any application. TS is only invariant to two kinds of transformations: a global phase shift, and a global time shift if T='global'. Yet, at the same time, due to its tree structure, TS cannot distinguish between and any arbitrary global time-shifting of individual its frequencies λ, including if said shifts are different. What keeps TS discriminative is the fact that such transformations are impossible: there aren't x that yield such X. This includes the exponential chirp formed from a unit impulse: such a chirp would need to be correlated with each frequency λ at only one time instant, which is impossible. The power of JTFS lies in achieving this discriminability with only two orders.
(2) The idea is energy conservation, which in turn ensures that features aren't over- (or under-) represented. F<2^J_fr, for example, would raise redundancy (create information overlap) of psi_t * phi_f and phi_t * phi_f pairs relative to unaveraged spinned pairs, and their energies would exceed conserving amounts. T<2^J has this effect in terms of orders, making higher orders recover more information than was lost in previous. Since our max_order<inf, and spinned can be averaged, both can be beneficial, but not for accurately representing each order (as a decomposition; and, in this case, we require average_fr=false).
(3) Strictly speaking, it is variation in time and log-frequency slices of , over time, which does happen, due to the wavelets varying in time support and bandwidth as a function of xi1. This variation is small for small sweeps across xi1.
(4) In "idealized" sense, which matters for direct comparisons against TS, as TS doesn't incur unpad losses along frequency. Depending on context, this can manifest as overestimates or underestimates of a given metric. In this case, the difference against out_3D=false is barely perceptible, but we use true as good practice. In other sections, this was deemed less important, or more code-inconvenient (out_3D=true requires average_fr=true, and changes output handling syntax with out_type='array'). Note, there's also time unpad losses; in short, all results (in all sections) remain faithful to intent, but for details, see (8).
(5) Spinned coefficients can significantly increase or decrease time-shift sensitivity relative to time scattering, depending on input signal. A decrease occurs per spatial smoothing along frequency axis, due to frequential scattering. An increase occurs per the scalogram being effectively "split up" between the two spins, lowering per-spin coefficient overlap upon a time shift, and driving up distance. The increase can persist, and remain significant, even with F='global' in extreme (and impractical) cases. In practice, we expect sensitivity to behave approximately as in our plot, especially for high T. (Note: due to time unpad losses (see (4)), the gap between F=0 and TS in the plot is exaggerated, though said gap is still small; also see (9).)
(6) While [9] states the result for , [8] proves it only for . Note, "size" of deformation includes and , respectively, the gradient of deformation, and size of deformation (with J dependence). [8] states that can be replaced with if the additional dependence of is introduced to the bounding distance. Yet, as , this term explodes, making the bound non-existent. For not too close to 1, we still see some approximate stability, but the scattering distance bound is no longer linearly proportional to size of deformation.
(7) Ratio of spinned coefficients' energy to non-spinned joint energy ("joint" = non-S0, S1), and, relative Euclidean distance between unaveraged (in time and frequency) spin up and down coefficients, should both be sufficiently large. So, E_spinned / E_nonspinned, and rel_l2(up, dn). From limited experimentation, "sufficiently large" are, respectively, 0.1 and 0.5. For the first metric, T = 2^J and F = 2^J_fr must be used (and default J and J_fr work well), and Q, r_psi should match what will actually be used.
(8) Any T='global' isn't affected at all, as there isn't unpadding. The frequency transposition example is negligibly affected. In the time-shift insensitivity example, the difference in plots is perceptible, but small. (The explanation for the two aforementioned cases concerns effective signal support given windowing, hence the amount of unpad losses, and whether time shifting is involved, which also affects losses.) The echirp example's ratio does change substantially (in testing, from to ), which is as expected as the boundary effect is exacerbated (more of it is kept), but a more realistic chirp example (and incidentally, more idealized) involves windowing, which'd greatly reduce this effect; windowing was omitted to illustrate this effect.
(9) The precise curves expected for F=0 vs TS for White Gaussian Noise weren't studied, but by reimplementing JTFS to include more scattering paths, observed were much fewer instances of F=0 distances exceeding TS's, and a greater (though still small) gap of TS exceeding F=0 for small T. This was only tested for the given shift, Q, and r_psi. Note that these differences against what's obtained in this example aren't practically significant, they're only for if we wish to be very precise (or to know if F=0 exceeding TS always holds for WGN).

References

[1] J. Muradeli. "Joint Time-Frequency Scattering explanation", Stack Exchange, 2021. https://dsp.stackexchange.com/a/78623/50076
[2] J. Muradeli, C. Vahidi, C. Wang, H. Han, V. Lostanlen, M. Lagrange, G. Fazekas. "Differentiable Time-Frequency Scattering on GPU", in DAFx20in22, 2022. https://dafx2020.mdw.ac.at/proceedings/papers/DAFx20in22_paper_25.pdf
[3] J. Andén, V. Lostanlen, S. Mallat. "Joint Time-Frequency Scattering", in IEEE Transactions on Signal Processing, vol. 67, no. 14, pp. 3704-3718, 2019. doi: 10.1109/TSP.2019.2918992. https://ieeexplore.ieee.org/document/8721532
[4] J. Andén, V. Lostanlen, S. Mallat. "Joint time-frequency scattering for audio classification", in IEEE 25th International Workshop on Machine Learning for Signal Processing, 2015. doi: 10.1109/MLSP.2015.7324385. https://ieeexplore.ieee.org/document/7324385
[5] V. Lostanlen, C. El-Hajj, M. Rossignol, G. Lafay, J. Andén, M. Lagrange. "Time–frequency scattering accurately models auditory similarities between instrumental playing techniques", in EURASIP J Audio Speech Music Process, 2021. doi: 10.1186/s13636-020-00187-z. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7801324/
[6] V. Lostanlen, F. Hecker. "The Shape of RemiXXXes to Come: Audio Texture Synthesis with Time-frequency Scattering", in arXiv, 2019. doi: 10.48550/arXiv.1906.09334. https://arxiv.org/abs/1906.09334
[7] C. Vahidi, H. Han, C. Wang, M. Lagrange, G. Fazekas, V. Lostanlen. "Mesostructures: Beyond Spectrogram Loss in Differentiable Time-Frequency Analysis". Journal of the Audio Engineering Society, 2023, 71 (9), pp.577-585. hal-04118474. https://hal.science/hal-04118474/document
[8] S. Mallat. "Group Invariant Scattering", in Communications on Pure and Applied Mathematics, vol. 65, no. 10, pp. 1331-1398, 2012. doi: 10.1002/cpa.21413. https://onlinelibrary.wiley.com/doi/abs/10.1002/cpa.21413
[9] J. Andén, S. Mallat. "Deep Scattering Spectrum", in IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114-4128, 2014. doi: 10.1109/TSP.2014.2328595. https://ieeexplore.ieee.org/document/6822556
[10] J. Muradeli. "Sleep Stage Classification using Joint Time-Frequency Scattering", 2024. MathWorks URL